Binary Classification using Logistic Regression using numpy

by Pritish Jadhav, Mrunal Jadhav - Sun, 08 Jul 2018
Tags: #python #logistic regression #machine learning #classification #supervised learning

Problem Definition -

  • Suppose you are an auto enthusiast. You have been capturing the pictures of cars for a long time and putting them in an album labelled as 'Cars'. Here the sorting of the photos is done manually.
  • To manage your library easily, you would like to automate the process wherein an image captured by camera is scanned. If the image is of a Car, it is automatically organised under the label-'Cars'.

Classification Algorithms -

  • Classification is a branch of supervised machine learning algorithms where target values are discrete.
  • In a binary classification problem, the output is 1 or 0 (Car or not). This is represented by Bernouli random variable. If can experiment is conducted with probability P, then it can take on 2 values, 1 for success and 0 otherwise. Thus the output is bounded on both ends.

So how do we solve the Classification problems ???

Logistic Regression

Logistic Regression is the classification algorithm for linearly seperable data points and dichotomous(binary) outcome. It works by learning the function of P(Y|X).
where Y is the indicator variable with inputs X

Given a data point(x,y), logistic regression assumes : P(Y=1| X=x)

We can say,
P(Y=1|X=x) = E(Y) (conditional expectation of the indicator variable)

Mathematically this is given by, P(Y=1|X=x) =σ(z) where z is given by,z=θ0+mi=1θixi

1. The Sigmoid Function

Logistic Regression starts by calculating the odds ratio.

Odds Ratio - Odds Ratio(OR) is defined as the ratio of the probability of success and the probability of failure. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.

The odds ratio is calculated from probability and unlike the probability, it may range from 0 to Odds Ratio = P(1P)

Naturally, the more likely it is for the positive event P(Y=1|X=x) to occur, the larger is the odds ratio.

Now, if we take Log on both sides, we get Logit Function

Logit Function is a monotonic function. The higher the log of odds ratio, the higher is the odds ratio.

  • It is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. logit(P) = log(P(1P)) =log(P)log(1P)

Therefore, logit(P) = θ0+mi=1θixi

The plot of the logit function is :

Now, take the inverse logit on both sides,we get sigmoid function logit1(logit(P)) =logit1( θ0+mi=1θixi)P=11+e(θ0+mi=1θixi)P=11+e(z)

The plot of the sigmoid function is :


P = sigmoid(z) =σ(z)

In Logistic Regression, we use Sigmoid Function to map the real value to a value between 0 and 1. In other words, it maps prediction to probability.

To Summarize the previous Section

P(Y=1|X=x) = sigmoid(z) = σ(z)

where,
z = θ0+mi=1θixi = θ0x0+mi=1θixi   (x0 = 1)

Therefore,
z = mi=0θixi
Vectorizing the above equation we can write,
P(Y=1|X=x) = σ(θTx)

2. Deriving the Cost Function

The calculation of the probabilities in logistic regression are parametrized by θ. Our goal is to estimate the values of these parameters. We estimate these my using Maximum Likelihood Estimator(MLE).
We perform this in 2 steps :

  1. Write the likelihood Function

    Likelihood Functions measures the goodness of a fit of a statistical mode to a sample of data for given values of unknown parameters.

  2. Find the values of the parameters that maximise the likelihood function

Step 1

The labels that we are predicting are binary, and the output of our logistic regression function is supposed to be the probability that the label is one. This means that we can (and should) interpret each label as a Bernoulli random variable: Y ∼ Ber(P) where P = σ(θ^Tx).

The Probability can be written as,
P(Y=1|X=x) = σ(θTx)

By the laws of Probability, P(Y=0|X=x) = 1 σ(θTx)
The compact way of writing these equations for a datapoint (x,y) is :

P(Y=y|X=x) = σ(θTx)y . (1 σ(θTx))(1y)


This is also the probability mass function of a Bernoulli. Now that we know the probability mass function we can write the likelihood of the data.

L(θ)= mi=1 P(Y=y(i) | X=x(i))
Substituting the likelihood of the Bernoulli,
L(θ)= mi=1 σ(θTx(i))y(i) . (1 σ(θTx(i)))(1y(i))
Taking the log of the function, we can get the log likelihood,

LL(θ)= mi=1 y(i)log σ(θTx(i)) + (1y(i))log (1 σ(θTx(i)))

Step 2

Now that we have likelihood function, we need to choose the values of theta that would maximise it. We can find the best values of theta by using an optimization algorithm such as Gradient Descent. In optimization algorithm we calculate the partial derivative of log likelihood with respect to each parameter.

  • Gradient Descent Works by minimising the result of a function. Minimising the negative log likelihood is equivalent to maximising the log likelihood.
  • The 1/m is to "average" the squared error over the number of components so that the number of components doesn't affect the function

Therefore, the Cost Function as follows :
J(θ)= 1mmi=1 y(i)log σ(θTx(i)) + (1y(i))log (1 σ(θTx(i)))

Here are the partial derivatives, we will derive later,

A= σ(θTx)J(θ)θ=1m(AY)TXJ(θ)θ0=1mmi=1(AY(i))

Lets Start by import python libraries that will help us accomplish the task

In [1]:
## python libraries
import pandas as pd
import numpy as np

## import sklearn for loading mnist data
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))

Notations that we will be using throughout are as follows

XInput data


YTarget labels


W weights vector (to be estimated using training data)

b intercept for the decision boundary

Notations for tracking dimensions are as follows mnumber of training examples


nnumber of input features

k size of labels vector(In case of binary classification, k = 1)

Load Cancer dataset provided by sklearn

In [2]:
def load_breast_cancer_dataset():
    '''
    This function loads the dataset. It also prints the description of the dataset.
    '''
    data_dict = load_breast_cancer()
    X, Y = data_dict['data'], data_dict['target'] #use transpose to match the dimesions defined above
    Y = Y.reshape(len(Y), 1) # reshape the label vector so that it has a dimesion of k x m
    return X, Y
In [3]:
X, Y = load_breast_cancer_dataset()

print "successfully loaded training data with %d training data and %d features"%(X.shape[0], X.shape[1])
successfully loaded training data with 569 training data and 30 features

Leverage the built in normalization techniques provided by sklearn.

In [4]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

print X.shape, Y.shape
(569, 30) (569, 1)

 While implementing logistic Regression from scratch, it is important to keep track of dimensions.

 For this tutorial, we shall use following dimensions

X= m× n 


Y= m× 1


W= 1× n

b= 1× 1

The pseudo-code for training a Logistic Regression model is as follows -

Step 1 - Given a Training DataSet X and corresponding labels Y, initialize values of number of training examples(m), number of features(n) and number of output units (k) as mentioned in equations 1, 2, 3 and 4

Step 2 - Initialize Weights matrix W and bias b using dimensions from Step 1.

for   iepoch   in   range(iterations):

Step 3 - Forward Propagation -
Step 3.1 compute preactivations using 5

pre_activations (Z)=XWT+b

Lets validate if the dimensions match - preactivations (Z)=( m× n)× (1× n) + (1×1)

We are good to move ahead.

Step 3.2 - Compute Activation using equation 6

activations (A)=σ(Z)= 11+e(Z)

Note that the dimensions of activations are same as dimensions of preactivations.



Step 3.3 - Compute loss /error /cost using 7 cost= 1mmi=1yilog(Ai)+ (1yi)log(1Ai)

It is important to note that, cost/error/loss is a scaler which we would like to minimize using gradient descent.

Step 4 - Back Propagation

We shall derive the grdients using chain rule -

LW = LAAZZW


Lets Start by computing the partial derivative - LA
LA=A[ mi=1yilog(Ai)+ (1yi)log(1Ai)]

For one training example i -

LA=[y(i)log(A)A+(1y(i))log(1A)A]=[y(i)A1y(i)1A]


Now lets compute the partial derivative AZ
AZ=Z[11+e(Z)]=Z[(1+e(Z))1]=(1+ez)1)2Z(1+eZ)=ez(1+ez)2Z(Z)=ez(1+eZ)2


Adding and Subtracting 1 in numerator of equation 10 AZ=1+eZ1(1+eZ)2=1+eZ(1+eZ)21(1+eZ)2=1(1+eZ)1(1+eZ)2

from equation 6 and 11, we get -

AZ=AA2=A(1A)

Multiply equations 9 and 12, we get

LZ=LAAZ=[y(i)A1y(i)1A]A(1A)=[y(i)(1A)A(1y(i))]=[y(i)y(i)AA+y(i)A]=[Ay(i)]

Now, The last part of this derivation to compute gradients with respect to weights/coeffiecients.

LW=LZZW=LZ[ZW1ZW2....ZWn]

The derivatives [ZW1ZW2....ZWn] can be easily calculated as follows -

ZW1=W1(x1w1+x2w2+....+xnwn+b)=x1

Similarly, ZW2=x2ZW3=x3...ZWn=xn

Now, the gradient of bias term b is simply - Lb=LZZB=(AY(i))b(x1w1+x2w2+....+xnwn+b)=(AY(i))(1)

Therefore, for one training example, the gradient wrt weights can be concisely computed using -

LW=(AY(i))[x(i)1x(i)2x(i)3....x(i)n]Lb=(AY(i))

By extending the computations for one training example to m training examples, we get -

LW=mi=1L(i)WLb=mi=1L(i)b

Equations 13 and 14 can be summarised as follows -

LW=1m(AY)TXLb=1mmi=1(AY(i))

Step 4.1 The weights can be updated using gradient descent equations as follows - W=WLWb=bLb

In [5]:
def initialize_dimensions(X, Y):
    '''
    This function initializes the values of m, n and k
    m -> number of training examples
    n -> number of features
    k -> number of output labels (for binary classification, k= 1)
    
    '''
    m, n = X.shape
    k = Y.shape[1]
    return m, n, k
In [6]:
def initialize_weights_with_zeros(dim):
    '''
    This function initializes the weights vector and bias vector
    
    The genaral formula for Weights matrix and bias vector is - 
    W --> number of output units x number of input features
    b --> 1 x number of output units
    
    '''
    W = np.zeros(shape = (dim))
    b = np.zeros(shape = (1, dim[0]))
    return W, b

The formula for sigmoid is given by -

σ(z)= 11+ez
In [7]:
def sigmoid(z):
    '''
    This function computes the sigmoid of vector (numpy array).
    '''
    return 1.0/(1 + np.exp(-1.0 *z))

The cost for logistic regression is given by -

cost (L)= 1mmi=1yilog(Ai)+ (1yi)log(1Ai)

Where,Y=vector of size m×1.A=activations=σ(z) vector of size m × 1

In [8]:
def compute_sigmoid_cost(Y, A, m):
    cost = -1.0/m *np.sum((Y* np.log(A) + ((1-Y)*np.log(1-A))), axis = 0)
    return cost
In [9]:
def compute_sigmoid_gradients(x, activations, y, m):
    dw = 1.0/m * np.dot((activations-Y).T, X)
    db = 1.0/m * np.sum((activations-Y), axis = 0, keepdims = True)
    return dw, db
In [10]:
def propagation(w, b, X, Y, m):
    # FORWARD PROPAGATION (FROM X TO COST)

    z = (np.dot(X, w.T) + b)
    activations = sigmoid(z)
    cost = compute_sigmoid_cost(Y, activations, X.shape[0])
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw, db = compute_sigmoid_gradients(X, activations, Y, m)

    cost = np.squeeze(cost)
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost
In [17]:
def train_logistic_regression(X, Y, learning_rate = 0.2, n_epochs = 3000, print_cost = True):
    m,n,k  = initialize_dimensions(X, Y)
    w, b = initialize_weights_with_zeros((k, n))
    
    costs = []
    for iteration in range(n_epochs):
        grad, cost = propagation(w, b, X, Y, m)
        
        dw = grad['dw']
        db = grad['db']
        
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record the costs
        if iteration % 10 == 0:
            costs.append(cost)

        # Print the cost every 100 training examples
        if print_cost and iteration % 500 == 0:
            activations = sigmoid(np.dot(X, w.T) + b)
            y_pred = np.where(activations > 0.5, 1, 0)
            accuracy = (float(np.sum(y_pred[:,0] == Y[:,0]))/ m)* 100
            display("Cost after iteration %i: %f | accuracy after iteration %i: %f" % (iteration, cost, iteration, accuracy))

    params = {"w": w,
              "b": b}

    grads = {"dw": dw,
             "db": db}

    return params, grads, costs,y_pred
In [18]:
kl = train_logistic_regression(X, Y, print_cost= True)
'Cost after iteration 0: 0.693147 | accuracy after iteration 0: 65.026362'
'Cost after iteration 500: 0.219201 | accuracy after iteration 500: 94.376098'
'Cost after iteration 1000: 0.169860 | accuracy after iteration 1000: 95.254833'
'Cost after iteration 1500: 0.147171 | accuracy after iteration 1500: 96.836555'
'Cost after iteration 2000: 0.133237 | accuracy after iteration 2000: 96.836555'
'Cost after iteration 2500: 0.123533 | accuracy after iteration 2500: 96.836555'

Awesome !!!

We were able to achieve 96.83% accuracy using logistic regressions.

Confusion Matrix

Confusion Matrix is the performance measurement for classification algorithms.It not only provides insight of errors made my model but also the type of erros made.
Some of the basic terminologies related to Confusion Matrix are as follows :

* True Positive — Label which was predicted Positive and is actually Positive.
* True Negatives — Label which was predicted Negative and is actually Negative.
* False Positive (Type 1 Error) — Label which was predicted as Positive, but is actually Negative.
* False Negatives (Type 2 Error)— Labels which was predicted as Negative, but is actually Positive 

Other Metrics that can be computed using Confusion Matrix are :

  1. Sensitivity - It is also called as Recall, Hit Rate, True Positive Rate(TPR). It measures the proportion of actual positives that are correctly identified. TPR= TPP = TPTP+FN


  2. Specificity - It is also called as Selectivity, True Negative Rate(TNR). It measures the proportion of actual negatives that are correctly identified. TNR= TNN = TNTN+FP


  3. Precision - It is also called as Positive Predictive Value.The precision metric shows the accuracy of the positive class. It measures how likely the prediction of the positive class is correct. Precision= TPTP+FP


  4. Accuracy Score- Accuracy is calculated as the number of all correct predictions divided by the total number of the dataset. Accuracy= TP+TNTP+TN+FP+FN


  5. F1 Score - The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. F1 Score= 2PrecisionRecallPrecision+Recall


In [80]:
"""
Here we consider the detection of breast cancer(Y=1) as positive result
and cancer not being detected (Y=0) as negative result

"""
def plot_confusion_matrix(actual,predicted):
    confusion_matrix=np.zeros((2,2))
    confusion_matrix[0][0]=int(((predicted==1)&(actual==1)).sum())
    confusion_matrix[0][1]=int(((predicted==0)&(actual==1)).sum())
    confusion_matrix[1][0]=int(((predicted==1)&(actual==0)).sum())
    confusion_matrix[1][1]=int(((predicted==0)&(actual==0)).sum())

    return confusion_matrix
In [83]:
confusion_matrix=plot_confusion_matrix(Y,Y_pred)
TP=confusion_matrix[0][0]
FN=confusion_matrix[1][0]
FP=confusion_matrix[0][1]
TN=confusion_matrix[1][1]
recall=TP/(TP+FN)
specificity = TN/(TN+FP)
precision=TP/(TP+FP)
accuracy= (TP+TN)/(TP+TN+FP+FN)
f1_score=(2*precision*recall)/(precision+recall)


print("The Confusion Matrix of the output is :")
print(confusion_matrix)
print("The Sensitivity is : ", recall)
print("The Specificity is : ", specificity)
print("The Precision is : ", precision)
print("The Accuracy is : ", accuracy)
print("The F1 Score is : ", f1_score)
The Confusion Matrix of the output is :
[[ 353.    4.]
 [  14.  198.]]
('The Sensitivity is : ', 0.96185286103542234)
('The Specificity is : ', 0.98019801980198018)
('The Precision is : ', 0.98879551820728295)
('The Accuracy is : ', 0.96836555360281196)
('The F1 Score is : ', 0.97513812154696133)
Congratulations on Completing Logistic Regression !!!


Can we use Linear Regression to solve the classification Problem???

A linear regression studies a relationship between
◦ a response variable Y and
◦ a single explanatory variable X.
However, Linear Regression doesn't seem to be a good fit for classification problems owing to the following reasons.



  • Linear Regression assumes normality for the residual errors(which represent the variation in Y) whereas a classification problem is not normally distributed.

  • The variance (and the standard deviation) for linear regression does not depend on X(input variable).
    For classification problems, the mean and the variance depends upon the probability. Thus any input affecting the probability of output, affects the mean and variance. This voilates the principles of linear regression.

  • Linear Regression deals with continuous variables instead of discrete variables. This may result in an output having value less than 0 or greater than 1.
    True probability has a value between 0 and 1.
    Thus Linear Regression fails to model true probability.

    • We can solve the problem of unbounded output by using the log p(x) to be the linear function of x, so that changing an input variable multiplies the probability by a fixed amount.
      Here, the problem is that logarithms are unbounded in only one direction, and linear functions are not.
In [ ]:
 

Comments